Goto

Collaborating Authors

 reward trap



Including such an analysis

Neural Information Processing Systems

This is a clear example of exploration-then-exploitation behaviour with exactly one phase change in the process.


Reviews: Explicit Planning for Efficient Exploration in Reinforcement Learning

Neural Information Processing Systems

This paper introduces the interesting idea of demand matrices to more efficiently do pure exploration. Demand matrices simply specific the minimum number of times needed to visit every state-action pair. This is then treated as an additional part of the state in an augmented MDP, which can then be solved to derive the optimal exploration strategy to achieve the specified initial demand. While the idea is interesting and solid, there are downsides to the idea itself and some of the analysis in this paper that could be improved upon. There are no theoretical guarantees that using this algorithm with a learned model at the same time will work.